24 research outputs found

    On a Near Optimal Work-Stealing Tree Data-Parallel Scheduler for Highly Irregular Workloads

    Get PDF
    We present a work-stealing algorithm for runtime scheduling of dataparallel operations in the context of shared-memory architectures on data sets with highly-irregular workloads that are not known a priori to the scheduler. This scheduler can parallelize loops and operations expressible with a parallel reduce or a parallel scan. The scheduler is based on the work-stealing tree data structure, which allows workers to decide on the work division in a lock-free, workloaddriven manner and attempts to minimize the amount of communication between them. A significant effort is given to showing that the algorithm has the least possible amount of overhead. We provide an extensive experimental evaluation, comparing the advantages and shortcomings of different data-parallel schedulers in order to combine their strengths. We show specific workload distribution patterns appearing in practice for which different schedulers yield suboptimal speedup, explaining their drawbacks and demonstrating how the work-stealing tree scheduler overcomes them. We thus justify our design decisions experimentally, but also provide a theoretical background for our claims

    Achieving Efficient Work-Stealing for Data-Parallel Collections

    Get PDF
    In modern programming high-level data-structures are an important foundation for most applications. With the rise of the multi-core era, there is a growing trend of supporting data-parallel collection operations in general purpose programming languages and platforms. To facilitate object-oriented reuse these operations are highly parametric, incurring abstraction performance penalties. Furthermore, data-parallel operations must scale when used in problems with irregular workloads. Work-stealing is a proven load-balancing technique when it comes to irregular workloads, but general purpose work-stealing also suffers from abstraction penalties. In this paper we present a generic design of a data-parallel collections framework based on work-stealing for shared-memory architectures. We show how abstraction penalties can be overcome through callsite specialization of data-parallel operations instances. Moreover, we show how to make work-stealing fine-grained and efficient when specialized for particular data-structures. We experimentally validate the performance of different data-structures and data-parallel operations, achieving up to 60X better performance with abstraction penalties eliminated and 3X higher speedups by specializing work-stealing compared to existing approaches

    Isolates, channels, and event streams for composable distributed programming

    Get PDF
    The actor model has been a model of choice for building reliable distributed systems. On one hand, it ensures that message-processing is serialized within each actor, preserving the familiar sequential programming model. On the other hand, programs written in the actor model are location-transparent. The model is sufficiently low-level to express arbitrary message protocols. Composing these protocols is the key to high-level abstractions. Unfortunately, it is difficult to reuse or compose message protocols with actors. Reactive isolates, proposed in this paper, simplify protocol composition with first-class typed channels and event streams. We compare reactive isolates and the actor model on concrete programs. We identify obstacles for composition in the classic actor model, and show how to overcome them. We then show how to build reusable, composable distributed computing components in the new model

    Duet Benchmarking: Improving Measurement Accuracy in the Cloud

    Full text link
    We investigate the duet measurement procedure, which helps improve the accuracy of performance comparison experiments conducted on shared machines by executing the measured artifacts in parallel and evaluating their relative performance together, rather than individually. Specifically, we analyze the behavior of the procedure in multiple cloud environments and use experimental evidence to answer multiple research questions concerning the assumption underlying the procedure. We demonstrate improvements in accuracy ranging from 2.3x to 12.5x (5.03x on average) for the tested ScalaBench (and DaCapo) workloads, and from 23.8x to 82.4x (37.4x on average) for the SPEC CPU 2017 workloads

    Containers and Aggregates, Mutators and Isolates for Reactive Programming

    Get PDF
    Many programs have an inherently reactive nature imposed by the functional dependencies between their data and external events. Classically, these dependencies are dealt with using callbacks. Reactive programming with first-class reactive values is a paradigm that aims to encode callback logic in declarative statements. Reactive values concisely define dependencies between singular data elements, but cannot efficiently express dependencies in larger datasets. Orthogonally, embedding reactive values in a shared-memory concurrency model convolutes their semantics and requires synchronization. This paper presents a generic framework for reactive programming that extends first-class reactive values with the concept of lazy reactive containers, backed by several concrete implementations. Our framework addresses concurrency by introducing reactive isolates. We show examples that our programming model is efficient and convenient to use

    On Lock-Free Work-stealing Iterators for Parallel Data Structures

    Get PDF
    With the rise of multicores, there is a trend of supporting data-parallel collection operations in general purpose programming languages. These operations are highly parametric, incurring abstraction performance penalties. Furthermore, data-parallel operations must scale when applied to irregular workloads. Work-stealing is a proven technique for load balancing irregular workloads, but general purpose work-stealing also suffers abstraction penalties. We present a generic data-parallel collections design based on work-stealing for shared-memory architectures that overcomes abstraction penalties through callsite specialization of data-parallel operation instances. Moreover, we introduce \textit{work-stealing iterators} that allow fine-grained and efficient work-stealing for particular data-structures. By eliminating abstraction penalties and making work-stealing data-structure-aware we achieve up to 60x better performance compared to JVM-based approaches and 3x speedups compared to tools such as Intel TBB

    Cache-Aware Lock-Free Concurrent Hash Tries

    Get PDF
    This report describes an implementation of a non-blocking concurrent shared-memory hash trie based on single-word compare-and-swap instructions. Insert, lookup and remove operations modifying different parts of the hash trie can be run independent of each other and do not contend. Remove operations ensure that the unneeded memory is freed and that the trie is kept compact. A pseudocode for these operations is presented and a proof of correctness is given -- we show that the implementation is linearizable and lock-free. Finally, benchmarks are presented which compare concurrent hash trie operations against the corresponding operations on other concurrent data structures, showing their performance and scalability
    corecore